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In this paper we consider the estimation of population size from 
one-source capture-recapture data, that is, a list in which individuals 
can potentially be found repeatedly and where the question is how 
many individuals are missed by the list. As a typical example, we 
provide data from a drug user study in Bangkok from 2001 where 
the list consists of drug users who repeatedly contact treatment in- 
stitutions. Drug users with 1, 2, 3, . . . contacts occur, but drug users 
with zero contacts are not present, requiring the size of this group to 
be estimated. Statistically, these data can be considered as stemming 
from a zero-truncated count distribution. We revisit an estimator for 
the population size suggested by Zelterman that is known to be ro- 
bust under potential unobserved heterogeneity. We demonstrate that 
the Zelterman estimator can be viewed as a maximum likelihood es- 
timator for a locally truncated Poisson likelihood which is equivalent 
to a binomial likelihood. This result allows the extension of the Zel- 
terman estimator by means of logistic regression to include observed 
heterogeneity in the form of covariates. We also review an estimator 
proposed by Chao and explain why we are not able to obtain similar 
results for this estimator. The Zelterman estimator is applied in two 
case studies, the first a drug user study from Bangkok, the second an 
illegal immigrant study in the Netherlands. Our results suggest the 
new estimator should be used, in particular, if substantial unobserved 
heterogeneity is present. 

1. Introduction. Registration files can be used to generate a list of indi- 
viduals from some population of interest. If each time that an observation 
of a population member occurs is registered but, for one reason or another, 
some population members are not observed at all, the list will be incomplete 
and will show only part of the population. In this paper we will further de- 
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velop a method proposed by Zelterman (1988) for estimating the size of a 
population using an incomplete list. 

Consider a population of size and a count variable Y taking values in 
the set of integers {0, 1, 2,3,...}. For example, in drug user studies Y might 
represent the number of contacts a drug user has with the treatment insti- 
tutions. Also denote with fo,fi,f2, ■ ■ ■ the frequency with which a 0, 1, 2, . . . 
occurs in this population. Consider now a registration where every contact 
with a treatment institution is registered and assume that a list of drug users 
is derived from this registration. Since a drug user will only be observed if 
there has been a positive number of contacts with the treatment institution, 
y = will not be observed in the list. Hence, the list reflects a count variable 
truncated at zero that we denote by 1+. Accordingly, the list has observed 
frequencies /i , /2 , . . . , but the frequency /o of zeros in the population is 
unknown. The size of the list is not N but n, where N = n + fo. 

The distribution of the untruncated and truncated counts are connected 
via P{Y+ = i) = P{Y = j)/{l - P{Y = 0)} for j = 1, 2, . . . . For example, if 
Y follows a Poisson distribution with parameter A so that 

(1.1) P(y = j) = Po(j|A) = e-^AVj!, 

for J = 0, 1, 2, ... , then the associated distribution of y+ is given as 

(1.2) P{Y+ =j) = Po+{j I A) = -^A^AVi!, 

\ — e ^ 

with J = 1, 2, 3, 

Given that all units of the population have the same probability PiiY > 
0) = P{Y > 0) = 1 — PiX = 0) of being included in the hst, the population 
size can be estimated by means of the Horvitz-Thompson estimator 

AT Inn 

^^•^^ =^P.(F>0) = i-p(y = o) = 1^^' 

where g{\) = e~^, or more generally, g{\) is the probability of a zero count 
for a given count distribution. For more details on this type of capture- 
recapture methodology, see van der Heijden et al. (2003a), van der Heijden, 
Cruyff and van Houwelingen (2003b), Bohning and Schon (2005), Roberts 
and Brewer (2006) or McKendrick (1926) (for a historic account). General in- 
troductions to capture-recapture are found in Bishop, Fienberg and Holland 
(1975), Hook and Regal (1995) and the contributions of the International 
Working Group for Disease Monitoring and Forecasting (1995a, 1995b). 

In what follows we further develop an estimator for A proposed by Zelter- 
man (1988), which can be used in (1.3) to obtain a population size estimate. 
This estimator for A uses limited information from the observed count distri- 
bution to arrive at an estimate of the population size, making it robust. Our 
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key extension to this estimator for A is to put it into a maximum likelihood 
framework, which allows further development using a regression framework. 
In Section 2 we review the Zelterman estimator, including its robustness 
properties. In Section 3 we demonstrate that the Zelterman estimator is a 
maximum likelihood estimator and use this result to estimate its variance 
and generalize the estimator to accommodate covariates. Section 4 points 
out the connections to Chao's estimator. The paper concludes with a case 
study section where we utilize examples from a Bangkok illicit drug user 
study and a reanalysis of illegal immigrant data analyzed earlier by van der 
Heijden et al. (2003a). 

2. The Zelterman estimator. In equation (1.3) we used the Horvitz- 
Thompson approach to arrive at an estimate of the population size. This 
approach requires that A is known and if it is not, it needs to be estimated. 
Clearly, A can be estimated with maximum likelihood under the assumption 
of a homogeneous truncated Poisson distribution. Instead of estimating A 
under the assumption of a homogeneous Poisson distribution, Zelterman 
(1988) argued that the Poisson assumption might not be valid over the entire 
range of possible values for Y but it might be valid for small ranges of Y such 
as from j to j + 1, so that it would be meaningful to use only the frequencies 
fj and fj+i in estimating A. Since for any j both the truncated as well as the 
untruncated Poisson distribution have the property that Po{j + 1 | A)/Po(j | 
A) = A/(j + 1) and Po+{j + 1 \ X)/Po+{j \ A) = A/(j + 1), respectively [see 
equations (1.1) and (1.2)], A can be written as 

_ {j + l)Po(j + 1 I A) _ (j + l)Po+{j + 1 I A) 
Po(j|A) Po+{j\\) 

An estimator for A is obtained by replacing Po+(j | A) by the empirical 
frequency fj-. 



(2.2) A 



If J = 1, we find Al = 2/2//1, and this estimator is often considered for two 
reasons: for one, Ai is using frequencies in the vicinity of /o which is the 
target of prediction, and two, in many application studies for estimating /o 
the majority of counts fall into /i and f2- Clearly, the estimator is unaffected 
by changes in the data for counts larger than 2, which contributes largely to 
its robustness. We will call Ai = 2/2//1 the Zelterman estimator for A and, 
when this estimate is used in (1.3), this leads to the Zelterman estimator of 
the population size, N . If the context is clear, we will simply use the term 
Zelterman estimator. 

The Zelterman estimator is an estimator which is very simple to under- 
stand and to use and this might be one of the reasons why it is quite popular 
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Table 1 

Frequency distribution fy of Metamphetamine users with exactly y repeated contacts with 

treatment institutions 

y 1 2 3 456789 10 11 12 

fy 3114 163 23 20 9 3 3 3 4 3 1 



in applications such as drug user studies [Hay and Smit (2003), Van Hest 
et al. (2007)]. It is also thought of as being less sensitive to model viola- 
tions than the estimator that is derived under the assumption of the homo- 
geneous Poisson distribution, that uses the entire range of frequencies fj. 
Indeed, the Zelterman estimator also works rather well with contaminated 
distributions as given by mixtures or approximated by mixtures [compare 
Zelterman (1988)]. We now look at a study to illustrate the application of 
the estimator. 

Example: (Estimating the number of Metamphetamine-users in Bangkok, 
2001). Let us illustrate the estimator for a data set of users of Metam- 
phetamine in Bangkok [Bohning et al. (2004)]. The distribution of contact 
counts with treatment institutions is provided in Table 1. 

In total 3346 users were observed. We find Ai = (2 x 163)/3114 = 0.1047 

(with 95% CI 0.0894-0.1225) and, using (1.3), this gives N = 3346/(1 - 
exp(0.1047)) = 33,664 (CI 28,520-38,808). The observed/hidden ratio equals 
3346/(33,664-3346) = 0.1104 and the completeness is 3346/33,664 = 0.0994. 
Note that the maximum likelihood estimator derived under the homogeneous 
Poisson assumption is A = 0.2463 (CI 0.2245-0.2703), leading to a population 
size estimate of iV = 3346/(1 - exp(-0.2463)) = 15,325 (CI 13,989-16,661), 
which differs considerably from the Zelterman estimator of the population 
size. The confidence intervals are based upon normal approximations using 
a variance expression given in Section 3.1 below. Since it is reasonable to 
assume that the counts stem from a contaminated distribution rather than 
from a homogeneous distribution, the Zelterman estimate appears to be 
more reasonable. In addition, the homogeneous Poisson estimate is biased 
downward if heterogeneity is present [van der Heijden et al. (2003a), van der 
Heijden, Cruyff and van Houwelingen (2003b), Bohning and Schon (2005)], 
so that a strong disagreement of the homogeneous Poisson estimate to the 
Zelterman estimate might be taken as an indication for a lack of fit for the 
homogeneous Poisson as occurs here. In such cases, the Zelterman estimate 
will be the better choice. 

3. The Zelterman estimator as a maximum likelihood estimator. In this 
section we will show that the Zelterman estimator is also a maximum likeli- 
hood estimator. It is based upon the observation that a Poisson distribution 
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with parameter A constrained to values Y = 1 and Y = 2 yields a binomial 
distribution with parameter p = (A/2) /(I + A/2) = A/(2 + A). This result wih 
allow for a simple derivation of the variance (see Section 3.1), as well as an 
extension of the Zelterman estimator that allows for covariates (see Section 
3.2). 

3.1. A likelihood for the Zelterman estimator. If we consider the proba- 
bility for a count of 1 and a count of 2 given as e~^A/(e~^A + e~^A^/2) and 
(e-^AV2)/(e-^A + e~^AV2), respectively, we see that after some simplifi- 
cation we have the likelihood 

(3.1) (-l^)"x(-A^)'^(l-rtV., 

which is proportional to a binomial likelihood with event parameter p = 
A/(2 + A), the probabihty for Y = 2. This binomial hkelihood is maximized 
for p = /2/(/i + /2). In addition, as A is connected uniquely to p via A = 
2p/{l — p), the invariance property of maximum hkelihood estimators yields 
Ai = 2/2 //i as a maximum likelihood estimate with respect to the likelihood 

(3.1) . We summarize this in the following theorem. 

Theorem 1. Consider a Poisson count Y where all counts are truncated 
unless Y = 1 or Y = 2. Then: 

(a) the associated likelihood is given by (3.1), 

(b) the maximum likelihood estimator with respect to (3.1) is 

P = /2/(/i + /2) or Ai = 2/2//i. 

One of the first benefits of identifying the Zelterman estimator Ai as 
a truncated maximum likelihood estimator is the fact that its variance is 
readily available as Var(p) = ^(1 — p)/(/i + /2), which can be estimated as 
/2/i/(/i + /2)^- Now, Ai = 2^^, and using a first order (5-method, 

Var log(Ai) = Var(log;3 - log(l - p)) « f- + '^^fj^ 
and, finally, plugging in an estimate for p, we arrive at 

(3.2) Varlog(Ai)«(— ^-A^(/i + /2)) ' = | + |, 

\/l + /2jl + j2 / Jl J2 

leading to a simple closed form expression for the variance of the logarithm 
of the Zelterman estimator. In addition, using a first order (5-method, we 
have that Var log Ai ~ jjYaiXi, which can be rephrased as 

Var Ai A^ Var log Ai . 
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Plugging in the Zelterman estimate for A leads to the result (b) in the 
following theorem. 

Theorem 2. Consider a situation as in Theorem 1. Then: 

(a) Varlog(Ai)« + 

(b) Var(Ai)«^^%tM. 

h 

3.2. Extension of the likelihood for covariates. A second benefit of identi- 
fying the Zelterman estimator as a truncated maximum likelihood estimator 
is that it is now easy to incorporate covariates into the modeling process. 
Let Z be a binary indicator variable indicating Z = 1 if y = 2 and Z = if 
Y = 1. Then, the likelihood (3.1) can be written as 

n^r(.-..'-f(,|0,)l-rl^)-- 

Suppose that covariates are available in the form of a vector Xj for the ith 
unit in the list. In a generalized linear model (logistic regression model) 
connecting the binary outcome probability pi with the linear predictor iji = 
P'^:x.i with a logit link, we have that 

Pi = • 

On the other hand, pi and the local Poisson parameter Aj are connected via 

A,/2 
l + Xi/2' 

so that Aj and the linear predictor rji are simply connected via Aj/2 = e''' 
or Aj = 2e^'. Note that the binary response probability P{Zi = 1) = pi is 
connected to the linear predictor i]i via the logistic link function, whereas 
the Poisson mean Aj = 2e^'^ uses the log link function, that is, both are 
generalized linear models using the canonical link functions. 

Maximum likelihood estimation can use existing tools for logistic re- 
gression. All that is needed is to regress the binary outcomes , . . . , 
on Xj to find the MLE /3 of /?. This provides the predicted probabilities 
Pi = e^"- /{I + e^^) and the Zelterman estimates of parameters Aj are obtained 
as 2pi/{l -pi). 

In order to derive a generalized Zelterman estimator of the population size 
N under this framework, we can employ the Horvitz-Thompson approach 
in the following way: 

" 1 

iVz = V 

j=i 1 - exp(-Ai) 
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(3.4) 

n 1 n 1 

= V - = V - 

^ l-exp(-2]5i/(l-p,)) ^ 1 - exp(-2e'?0 

In addition, it is possible to find an estimate of tlie variance of the gener- 
alized Zelterman estimator (3.4) which we write as 

"1 ^ A- 

where Wi = 1 — exp(— 2e^') and Aj is an indicator which is 1 (unit is sam- 
pled) with probability Wi and (unit is not sampled) with probability l — Wi. 
Note that Wi = 1 — exp(— 2e^') is not fixed, but a random quantity itself. This 
excludes the direct application of known variance formulas for the Horvitz- 
Thompson estimator and their variations such as Sen- Yates-Grundy [for 
details, see Thompson (2002), pages 54-55]. Variance estimation of the 
Horvitz-Thompson estimator with estimated Wi (which might no longer be 
called the Horvitz-Thompson estimator) needs to take into account the vari- 
ability in estimating the linear predictor fji. This problem was first pointed 
out by Huggins (1989). To accomplish the task, we use the techniques of 
conditional moments [see Ross (1985), page 125] and results from van der 
Heijden et al. (2003a). Details are in the Appendix. We state here only the 
final variance approximation: 

(3.5) Var(jVz) ^ E(l " ^^)M^ + E ( ~J ) ^ICov0)^. 

i=l i=l 



Wt 



where Wi = l — exp(uj) and Vi = — 2e^% so that wi = Wi{j3) = 1 — exp(uj 
1 - exp(-2e^0 = 1 - exp(-2e'^'' 



oT 



4. The connection to Chao's estimator. In this section we point out some 
connections to another population size estimator proposed by Chao (1987, 
1989) that also uses only the counts /i and f2- We provide these results be- 
cause generalizing this estimator into a maximum likelihood framework was 

less successful. Chao suggested the estimator A'^c = n + The estimator 
is based upon the Cauchy-Schwarz inequality [see also Wilson and Collins 
(1992)] for the nonparametric mixture of a Poisson, namely, 

oo \ 2 i-oo roo 

Xe-^dX] < e-^dX X^e-^dX, 



/Jo Jo 

where the inequality of the Cauchy-Schwarz (/ uv)'^ < (/ m^)(/ f^) is used 
with u(A) = V and v{X) = XVe~^ and leading to Pi<PoX 2p2i so that 
0- estimates a lower bound for /q. Chao (1987, 1989) suggests to use this 
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bound as an estimator if higher frequency counts are smah. It is mentioned 
frequently in the apphed statistical literature [see, e.g., Smit, Reinking and 
Reijerse (2002)] that the Zelterman estimator and Chao's estimator are often 
quite close. Indeed, if we compute the Chao estimator in our drug user 
example, we find that Nq = 3346 + 3114V(2 x 163) = 33,091 (95 percent 
CI is 28,058-38,124), which is not far from Nz = 33,664. Furthermore, it is 
often claimed that the Zelterman estimator is usually larger than Chao's 
estimator as it is in our example here. Hence, it is interesting to investigate 
the relationship between the two estimators more theoretically. 

The Zelterman estimator and the Chao estimator are connected as follows. 
Let us write the Zelterman estimate for /o as 

exp(— A) n n 

l-exp(-A) exp(A)-l A+1/2A2' 

using the first three terms of the MacLaurin series for the exponential func- 
tion: exp(x) = 1 + X + ^x"^ + • • • . This can be further written as 

n ^ fi n ^ fl 
A + 1/2A2 2/2 /i + /2 - 2/2 ' 

the latter being Chao's lower bound estimate of /q. Two statements follow 
now easily from this representation and are summarized in Theorem 3 below. 

Theorem 3. Consider a situation as in Theorem 1: 

(a) Assume that j^^^rj^ ^ ^' ^^^n, for any e > exists 5 > such that 

if^<5, then Nc<Nz + e. 
Ji 

(b) // 7^ = 1, then Nc > Nz and Nz - Nc = 0(A?). 

Zelterman's estimator is not always larger than Chao's. Note that state- 
ment (b) gives a condition which leads to Chao's estimator being larger than 
the one of Zelterman. Statement (b) follows from the fact that n/[exp(A) — 
1] <n/(A + iA2) for any nonnegative A. The term n/{X + ^A^) simplifies to 
/i 7(2/2) [n/(/i + /s)] = f 1/(2/2) and the result follows. The second part of 
statement (b) follows from the fact that 

00 

exp(A) - 1 = ^ AVi! = A + AV2 + A3(1/3! + A/4! + •••), 

1=1 

where the left-hand side corresponds to the Zelterman estimator and the 
first two terms of the right-hand side correspond to the Chao estimator. 
This ends the proof. 
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Note the difference between statements (a) and (b) in Theorem 3. State- 
ment (b) says that the estimators of Chao and Zelterman are close, with 
Chao's estimator larger than the one of Zelterman if the ratio of the count 
of twos to the count of ones is small and the proportion of both of them 
among all observations is close to one. Statement (a) says that the estimator 
of Chao is bounded above by the estimator of Zelterman (but they need not 
to be close) if the ratio of the count of twos to the count of ones is small. 

Some elementary calculations show that A^'c = n + ^ also satisfies 



1-P?/(2P2) l-/2/(2/2iVc)' 

where pj = fj/Nc for j = 1,2. Unfortunately, (4.1) contains A^c on both 
sides of the equation, which causes difficulties when we aim to generalize this 
for data with covariate information. More details on this aspect of Chao's 
estimator are available from the authors upon request. 



5. Examples. 



5.1. The Bangkok drug users study example. We will illustrate the gen- 
eralized Zelterman approach using the Bangkok drug users study [Bohning 
et al. (2004)] introduced in Section 2. Let us consider the female drug users 
only. Tables 2 and 3 show the distribution of contact counts to treatment in- 
stitutions by age for Metamphetamine and Heroin users respectively. These 
are very different subpopulations of the drug user population in the Bangkok 
metropolis, as indicated by the quite different age distributions. Clearly, the 
age distribution of the Metamphetamine users is younger than the age dis- 
tribution of the Heroin users (see Tables 2 and 3). To analyze these data, 
STATA and GAUSS macros are available in the supplemental articles Bohning 
and van der Heijden (2008a, 2008b). The results of the analysis are provided 
in Table 4. None of the subpopulations seems to be affected by age as fol- 
lows from a likelihood ratio test. Accordingly, the population size estimates, 
unadjusted and adjusted for age, do not differ much. Whereas for the fe- 
male Heroin user population a completeness of identification of about 50% 
is reached (268/504), the completeness of identification is less than 10% for 
the Metamphetamine users (274/3714). 

5.2. The illegal immigrant's study. As a second example, we discuss the 
estimation of the number of illegal immigrants in four large cities in the 
Netherlands from police records, analyzed with the truncated Poisson re- 
gression model by van der Heijden et al. (2003a). In their analysis van der 
Heijden et al. focus on those illegal immigrants that, once apprehended, can- 
not be effectively expelled by the police because, for example, their home 
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Table 2 

Distribution of repeated contact counts y to treatment institutions 
of female Metamphetamine users by age 



Age 




# users with y 


contacts: 




AU 


1 


2 


3 


4 


13 


3 











3 


14 


5 











5 


15 


23 











23 


16 


18 


1 








19 


17 


19 


1 








20 


18 


21 


1 


1 





23 


19 


23 


1 








24 


20 


23 











23 


21 


17 





1 





18 


22 


22 


1 








23 


23 


10 


2 








12 


24 


15 











15 


25 


13 


2 








15 


26 


12 











12 


27 


6 











6 


28 


4 











4 


29 


4 











4 


30 


5 











5 


31 


4 











4 


32 


1 











1 


33 


1 


1 








2 


34 


2 











2 


35 


2 











2 


36 


3 








1 


4 


37 


3 











3 


38 


1 











1 


39 


1 











1 


All 


261 


10 


2 


1 


274 



country does not cooperate with the organization of deportation. In such 
cases the poHce request the individuals to leave the country, but it is un- 
likely that they will abide by such a request. Hence, they can be apprehended 
multiple times. The data contain four covariates: gender, age, home country 
and reason for being arrested (or rearrested). For details about the data we 
refer to van der Heijden et al. (2003a). The observed frequencies for the co- 
variate categories can be found in Table 5 and are reproduced from van der 
Heijden et al. (2003a). The data are provided in a supplemental file [Bohning 
and van der Heijden (2008c)]. 

In Table 6 we provide the estimates of both the truncated Poisson regres- 
sion model as well as the Zelterman regression model. Both models provide 
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Table 3 

Distribution of repeated contact counts y to treatment institutions 
of female Heroin users by age 



Age 




# users with y 


contacts: 




All 


1 


2 


3 


4 


16 





1 








1 


17 


1 











1 


18 


3 








1 


4 


19 


1 


1 


1 


2 


5 


20 





3 


2 


2 


7 


21 


6 








7 


13 


22 


3 


5 


1 


5 


14 


23 


10 


3 


2 


9 


24 


24 


11 


4 


4 


9 


28 


25 


8 





1 


2 


11 


26 


13 


4 


3 


4 


24 


27 


6 





1 


7 


14 


28 


4 


1 


2 


3 


10 


29 


4 


3 


1 


2 


10 


30 





2 


1 


2 


5 


31 


3 


1 


2 


3 


9 


32 


4 


1 





1 


6 


33 


6 


1 


3 


1 


11 


34 


2 


2 





3 


7 


35 


2 





1 





3 


36 


2 


1 





3 


6 


37 


3 


3 


1 


1 


8 


38 


3 


1 


1 


2 


7 


39 





2 








2 


40 


4 


2 


1 





7 


41 


1 


2 


1 


1 


5 


42 


4 








1 


5 


43 


2 








1 


3 


44 


2 





2 


1 


5 


45 


1 








1 


2 


46 











1 


1 


47 


2 


1 








3 


48 


1 





1 





2 


49 


1 








1 


2 


52 


1 











1 


58 


1 











1 


62 


1 











1 


All 


116 


44 


32 


76 


268 
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Table 4 



Estimated population size of female drug 


users in 


Bangkok with 95% 


confidence 


interval 


without and with adjustment for age 


of drug ■ 


user, and logistic log-likelihood LL 


Drug 


Covariates 




Nz (95% CI) 




LL 


HGroin 




None 




504 (389-628) 




-94.11 






AGE 




505 (379-630) 




-93.86 


Metamphetamine 




None 




3714 (1417-6011) 




-42.81 






AGE 




3772 (1376-6169) 




-42.72 






Table 5 








Illegal immigrants not 


effectively expelled. Observed frequencies for covariate categories 


Covariate category 


fi 


h 


fs 


/4 /5 


fe 


Total 


>40 years 


105 


6 








111 


<40 years 


1540 


177 


37 


13 1 


1 


1769 


Female 


366 


24 


6 


1 1 




398 


Male 


1279 


159 


31 


12 


1 


1482 


Turkey 


90 


3 








93 


North Africa 


838 


146 


28 


9 1 


1 


1023 


Rest Africa 


229 


11 


3 






243 


Surinam 


63 


1 








64 


Asia 


272 


9 


1 


2 




284 


America, Australia 


153 


13 


5 


2 




173 


Being illegal 


224 


29 


5 


1 




259 


Other reason 


1421 


154 


32 


12 1 


1 


1621 



similar point estimates, but the estimated standard errors of the Zeherman 
regression model are somewhat larger than those of the truncated Poisson 
regression model, yielding less parameter estimates in the Zelterman regres- 
sion model deviating significantly from zero. 

In Table 7 we present the population size estimates for a series of models. 
The top panel has been reproduced from van der Heijden et al. (2003a). It 
shows that the truncated Poisson regression model with covariates Gender, 
Age and Nation provides the best fitting main effects model both in terms of 
deviance as well as AIC, and when these three variables are included Reason 
does not increase the fit significantly. The population size estimate is 12,690 
with a CI of (7186-18,194). 

Interestingly, the top panel provides for each model a Lagrange multiplier 
test [Gurmu (1991)] that can be used to test for overdispersion in the zero- 
truncated Poisson regression model as a result of unobserved heterogene- 
ity. This test compares the model fit of the Poisson model with alternative 
models with an extra dispersion parameter included, such as the negative 
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binomial regression model. Van der Heijden et al. (2003a) and Bohning and 
Schon (2005) show that, if there is evidence for unobserved heterogeneity in 
a model, the population size estimate will underestimate the true population 
size [see also Bohning and Kuhnert (2006)]. For the illegal immigrant data 
this appears to be the case for every model in the top panel of Table 7. 



Table 6 

Truncated Poisson regression model (columns 1 and 2) and Zelterman regression model 
( columns 3 and ^ ) fit to the illegal immigrants data 



Regression parameters 


MLE 


SE 


MLE-Z 


SE-Z 


Intercept 


-2.317 


0.449 


-3.359 


0.528 


Gender (male = 1, female = 0) 


0.397 


0.163 


0.535 


0.232 


Age (<40 yrs = 1, >40 yrs = 0) 


0.975 


0.408 


0.567 


0.434 


Nationality 










(Turkey) 


-1.675 


0.603 


-1.030 


0.657 


(North Africa) 


0.190 


0.194 


0.579 


0.307 


(Rest of Africa) 


-0.911 


0.301 


0.664 


0.425 


(Sminam) 


-2.337 


1.014 


-1.720 


1.050 


(Asia) 


-1.092 


0.302 


-1.056 


0.448 


(America and Australia) 


0.000 




0.000 




Reason (being illegal = 1, else = 0) 


0.011 


0.162 


0.189 


0.220 



Table 7 

Estimates N and 95% confidence intervals for N obtained from fitting different truncated 
Poisson regression models (first five models) and Zelterman regression models (last five 
models). Model comparisons using the likelihood-ratio test and AlC-criterion are also 
given, xfi) the Lagrange multiplier test testing for overdispersion in the Poisson 

regression model 





AIC 




df 


P* 


X(i) 


N 


CI 


Poisson regression 
















Null 


1805.9 








106.0 


7080 


6363-7797 


G 


1798.3 


9.6 


1 


0.002 


99.7 


7319 


6504-8134 


G + A 


1789.0 


11.2 


1 


<0.001 


93.7 


7807 


6637-8976 


G + A + N 


1712.9 


86.1 


5 


<0.001 


55.0 


12,690 


7186-18,194 


G+ A+N+R 


1714.9 


0.004 


1 


0.949 


55.0 


12,691 


7185-18,198 


Zelterman regression 
















Null 


1191.4 










9424 


8084-10,765 


G 


1184.3 


9.1 


1 


<0.003 




9970 


8327-11,614 


G + A 


1182.9 


3.5 


1 


0.061 




10,213 


8416-12,009 


G + A + N 


1131.7 


61.1 


5 


<0.001 




16,129 


9973-22,286 


G+ A+N+R 


1133.0 


0.7 


1 


0.403 




16,188 


9983-22,394 



*P- value for likelihood- ratio test. 
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We now turn to the results for the Zelterman regression model, presented 
in the bottom panel of Table 7. Here the model with Gender, Age and Nation 
is also the best model in terms of model fit as well as AIC. If we compare 
the population size estimates under the truncated Poisson regression model 
with those under the Zelterman regression model, we find that, for models 
with identical covariates, the population size estimates under the Zelterman 
model are much larger. This suggests that the Zelterman model corrects 
for the downward bias in the population size estimates from the truncated 
Poisson regression model when over dispersion is present. 

APPENDIX: VARIANCE ESTIMATION UNDER COVARIATES 

We now provide details for computing a variance estimate of the general- 
ized Zelterman estimator (3.4), which we write as 

"1 ^ A- 

1=1 ' 1=1 ' 

where Wi = 1 — exp(— 2e''') and Aj is an indicator which is 1 (unit is sampled) 
with probability Wi and (unit is not sampled) with probability 1 — Wi. 

We use the techniques of conditioning to develop a variance estimator of 
(3.4) and follow the methodological development in van der Heijden et al. 
(2003a). We have that [see Ross (1985), page 125] 

(A.l) Var(iVz) = Var„[S(iVz|n)] + £;„[Var(iVz|n)], 

where moments inside the brackets are computed conditional upon n and 
moments outside the bracket refer to the marginal distribution of n. Consider 
E{Nz\n) and its estimate 

n TV . 

i?(iVz|n) = E- = E-- 

Wi Wi 

1=1 ' 1=1 ' 

Consequently, 

/ ^ AA ^ /AA ^ ^ 
Var„( E — 1 =E^^^'M ~ ) =E^«(^ ~ Wi)/w'f = E(l - Wi)/wi, 

\i=l^*/ i=l V"^*/ i—i i—i 

for which an unbiased estimator can be provided as 

/ ^ aA ^ 

(A.2) Var J E - = ^ ^'^^ " ^^)/^*' = " ^^)/^*'- 
\i=i ^« / i=i i=i 

We move on to consider the second term, £'„[Var(iVz|n)], involved in 
(A.l). We write 

TV ^ 

VI, ... , Atv 



(A.3) Var(iVz|n) =Var( E ^ 



• 1 Wi 

1=1 ' 
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so that 

Var(iVz|n) = 

Recall that Wi = 1 — exp(fj) and Vi = —2e^^, so that 

Wi = Wi0) = 1 - exp{vi) = 1 - exp(-2e''0 = 1 - exp(-2e'^^''')- 

Consequently, Wi{(5) and Wj{(5) will not be independent for i 7^ j, since both 
depend on a common (5. An application of the multivariate (^-method as 
done by van der Heijden et al. (2003a) provides 

(A.4) Vu;,(/3)^^ Cov0) VwS)^ , 

where 

(A.5) VwS) = ^^—^^i. 

Summing (3.5) and (A.4) give the full variance approximation of Var(iVz). 
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SUPPLEMENTARY MATERIAL 

Computer programmes and illegal immigrant data 
(DOL 10.1214/08-AOAS214SUPP; .zip). 
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